Latent Class Analysis

Daniel Martin, Kameron Standhaft, Esra Ari Acar

Overview

  • What is Latent Class Analysis (LCA)?
  • Usage of LCA
  • Methods
  • Our dataset
  • Our goal
  • Best number of classes
  • Analyze class attributes
  • Correlation
  • Conclusion

What is Latent Class Analysis (LCA)?

  • Latent Class Analysis (LCA) is a probabilistic method of unsupervised clustering that can be utilized when it is believed that there may be unobserved subgroups (classes) among the individuals within a population. (Nylund-Gibson and Choi 2018)

  • The central assumption of LCA model is the presence of latent classes within a population. (Weller, Bowen, and Faubert 2020)

  • LCA is a type of finite mixture model (FMM) which is a statistical approach used in unsupervised learning.(Grimm, Houpt, and Rodgers 2021)

  • LCA is a useful model for identifying subgroups within a population based on patterns in categorical variables.

  • In LCA, the variables are unrelated within each subgroup but similar across each one.

Figure 1 - 2 Class Model Representation for LCA (Sinha, Calfee, and Delucchi 2021)

Advantages:

  • Identifying hidden subgroups

  • Handling categorical data

  • Providing probability estimates

  • Handling missing data

  • Model selection

Drawbacks:

  • LCA is computationally demanding, limiting the number of variables in the analysis

  • It may be challenging to determine appropriate number of latent classes.

Usage of LCA

LCA has been used in various fields, such as psychology, sociology, public health, business & marketing research.

  • Case 1: Applying LCA to identify subgroups of children with similar patterns of mental health symptoms.

The Application of Latent Class Analysis for Investigating Population Child Mental Health (Petersen, Qualter, and Humphrey 2019)

  • Case 2: Improving marketing research using LCA models.

Latent Class Analysis for Marketing Scale Development (Bassi 2011)

Methods

The standard equation (Naldi and Cazzaniga 2020) for LCA model is:

\[ p(x_i)=\sum_{k = 1}^{K}{p_k}\prod_{n = 1}^{N}{p_n(x_{in}|k)},\]

  • \(p(x_i)\): the probability of observing a particular combination of responses in a group of \(N\) variables

  • \(p_k\): the probability of membership in LC \(k\)

  • \(p_n(x_{in}|k)\): the probability of response to variable \(n\), conditional on membership in LC \(k\)

Two parameters of an LCA model:

  • Inclusion probabilities (\(p_k\))

  • Conditional probabilities (\(p_n(x_{in}|k)\))

The general steps to use latent class analysis are as follows:

  • Identify the research question and define the variables

  • Select the appropriate software

  • Determine the number of latent classes for an ideal model

  • Estimate the model: Estimate the model parameters using maximum likelihood estimation or Bayesian estimation

  • Evaluate the model: Goodness-of-fit statistics (the likelihood ratio test, Akaike Information Criterion \((AIC)\), Bayesian Information Criterion \((BIC)\), entropy and chi-square Goodness-of-Fit \((\chi^2)\))

  • Conduct sensitivity analysis: Finally, conduct sensitivity analyses (e.g. correlation test) to test the robustness of the results and evaluate the stability of the estimated probabilities across different subgroups or samples.

Our dataset

To assess the effectiveness of Latent Class Analysis, we used the ‘Zoo Animals’ dataset from Kaggle. This dataset has 101 entries and is an ideal choice because of its majority of binary variables.

Dropped

There were 2 variables (Catsize and Domestic) that we dropped because we thought they were subjective. And we dropped the ‘Animal’ variable because it was an identifier.

Changed

There was one variable in our dataset that was discrete, so we changed it to a binary variable.



To perform our analysis, we are using these 14 variables.

Our goal

We are not including the ‘type’ variable in our analysis. Instead we are going to use its different categories (mammal, bird, reptile, fish, insects, amphibian, invertebrate) as the classes in our Latent Class Analysis.

Best number of classes

In order to determine the optimal number of classes that would best model our dataset, we conducted a series of analyses beginning with 2 classes and concluding at 7 classes, which was the number of different types of animals in the dataset. The code presented below was utilized to fit the model for each of the different number of classes.

library(poLCA)

lca_fit2 <- poLCA(lca_bind, data = new_zoo_int_1, 
                  nclass = 2, graphs = FALSE, na.rm = TRUE, 
                  nrep=100, maxiter=100, verbose = FALSE)

The poLCA function has the following options: -nclass -graphs -na.rm -nrep -maxiter -verbose

This function is estimating “Latent Class Prevalence”, which is the probability that each dataset entry belongs within one of the model-generated classes (Law and Harrington 2016)

LCA 2-Class Model

Model Comparison Statistics

LCA Model AIC & BIC Comparison

LCA Model GoF Comparison

Entropy is a measure of the concentration in a probability function(R Core Team 2021)

We use the poLCA.entropy() function to compare our 5-Class and 6-Class models.

library(poLCA)

lca_fit5.ent <- poLCA.entropy(lca_fit5)
lca_fit6.ent <- poLCA.entropy(lca_fit6)

LCA 5-Class Model

LCA 5-Class Model

Analyze Class Attributes

Class Population Proportion Distribution

#Reordering the Model output graph to display highest->lowest proportion for easier comparison and labeling
probs.start.new <- poLCA.reorder(lca_fit5$probs.start,order(lca_fit5$P,decreasing=TRUE))

lca_fit5 <- poLCA(lca_bind, data = new_zoo_int_1,nclass=5,
                  graphs = FALSE,na.rm = TRUE,
                  verbose = FALSE, nrep=100, maxiter=100,
                  probs.start=probs.start.new)

orig_classes <- data.frame( 
   Class = c('Class 1', 'Class 2', 'Class 3', 'Class 4', 'Class 5'),
   Percentage = c('35.46%', '20.78%','17.82%','13.09%','12.85%'))

Class 1 which, as the largest class, is estimated to contain ~35.46% of the population predicts its members to have ‘hair’, ‘toothed’, ‘backbone’, ‘breathes’, ‘tail’, ‘has_legs’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘mammal’ in the original dataset.

Class 1 (view displays a subset of population)

Class 2 which is estimated to contain ~20.78% of the population predicts its members to have ‘feathers’, ‘eggs’, ‘backbone’, ‘breathes’, ‘tail’, ‘has_legs’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘bird’ in the original dataset.

Class 2 (view displays a subset of population)

Class 3 is estimated to contain ~17.82% of the population. It should be noted that it has two attributes (‘breathes’ and ‘predator’) which were both over 50% shown on the graph, but as we excluded our initial filtering assessment to ~80% and above, these are not selected in the filter.

These class members are predicted to have ‘eggs’ and ‘has_legs’. While this limited selection duplicates a few of the previous members that were selected for other classes, this class also contains the invertebrates and insects, which were not present in the other classes. Based solely upon outside comparison to the type attribute in the original dataset, this class appears to be the least homogeneous.

Class 3 (view displays a subset of population)

Class 4 which is estimated to contain ~13.09% of the population is less homogeneous than Classes 1 & 2, but not as heterogeneous as Class 3. The original classification schema of “type” had 7 unique values, which implies that with a 5-class model, overlap is inevitable.

This is only visible due to the fact that we have the original data for “type” for illustration purposes, which is unlikely to be the case in research data.This class’ members are predicted to have ‘aquatic’, ‘predator’, ‘toothed’, ‘backbone’, ‘breathes’, ‘tail’. We filtered the cleaned new_zoo dataset for these attributes. The resulting filter only yielded 5 members, which are mostly aquatic mammals and a single amphibian.

Class 4 (view displays a subset of population)

Class 5 which is the smallest class, estimated to contain ~12.85% of the population predicts its members to have ‘eggs’, ‘aquatic’, ‘toothed’, ‘backbone’, ‘fins’, ‘tail’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘fish’ in the original dataset.

Class 5 (view displays a subset of population)

LCA 5-Class Model Biplot

Sensitivity Analysis

As seen in the correlation heatmap figure below, ‘milk’ is highly positively correlated with both ‘hair’ and negatively correlated with ‘egg’. We decided to remove ‘milk’ as an attribute, and re-consider our LCA model to see if it improves without the variables exhibiting this high degree of multicollinearity included.

Correlation heatmap for sensitivity analysis

#Creating a dataset without 'milk' - 13-variables
new_zoo_int_2 <- new_zoo_int_1  %>% mutate(milk=NULL)

#Binding 13-variables into columns for modified LCA model without 'milk'
lca_bind_corr <-  cbind(hair, feathers, eggs, airborne, aquatic, predator, 
                   toothed, backbone, breathes,venomous, fins, tail, has_legs) ~ 1

lca_fit5_corr <- poLCA(lca_bind_corr, data = new_zoo_int_2, 
                  nclass = 5, graphs = FALSE, na.rm = TRUE,
                  verbose = FALSE, nrep=100, maxiter=100)
lca_fit5_corr.ent <- poLCA.entropy(lca_fit5_corr)

#Creating data frame of AIC, BIC, GoF, and Entropy of both original and modified 5-Class Models
class5_compare <- data.frame(Model = c('Original 5-Class Model', 'Adjusted 5-Class Model'),
   AIC = c(lca_fit5$aic, lca_fit5_corr$aic), BIC = c(lca_fit5$bic, lca_fit5_corr$bic),
   GoF = c(lca_fit5$Chisq, lca_fit5_corr$Chisq), entropy = c(lca_fit5.ent, lca_fit5_corr.ent))

Comparison of 5-Class Models

Given improvement in all fit statistics except entropy for the 5-Class model with the high correlation variable ‘milk’ removed, we have decided to remove ‘milk’ for our final model and are left with 13-attributes for optimal classification by Latent Class Analysis.

Class 3 & Class 4 show the greatest adjustments to the population proportions as compared to the 5-class model with ‘milk’ still considered. Class 3 sees about a 1% drop in membership, and Class 4 sees ~1.4% increase. Because these two classes were our most homogeneous, it makes sense that they might see the greatest changes in proportion.

5-Class LCA Model Population Percentages without ‘milk’

Conclusion

Our research indicates that Latent Class Analysis (LCA) is an exceptionally effective tool at classifying categorical variables based upon their attributes. With the dataset-defined “type” classification as a reference, our 5-class predictive model has shown that LCA as a process can be used to successfully define where relevant classes exist when not otherwise defined from collected data, even with our relatively small dataset (101 points).

Even with its very powerful modeling potential, LCA does pose some challenges that ust be accounted for by the researcher. As we have seen, it requires substantial testing of fit metrics to establish the most ideal number of classes for our model to achieve our desired result(Asparouhov and Muthén 2014). There are also interpretation challenges that exist once we have our best LCA model established(Weller, Bowen, and Faubert 2020).

Our research indicates that Latent Class Analysis (LCA) is an exceptionally effective tool at classifying categorical variables based upon their attributes. With larger datasets, and proper screening of models, LCA is a powerful tool for finding group commonalities between members of a group, particularly in disciplines where research is often done through self-reporting and assessment of softer-science categories such as opinion, behavioral observations, and medical research(Kaplan (2004)).

References

Asparouhov, Tihomir, and Bengt Muthén. 2014. “Auxiliary Variables in Mixture Modeling: Three-Step Approaches Using m Plus.” Structural Equation Modeling: A Multidisciplinary Journal 21 (3): 329–41.
Bassi, Francesca. 2011. “Latent Class Analysis for Marketing Scale Development.” International Journal of Market Research 53 (2): 209–30.
Grimm, Kevin J, Russell Houpt, and Danielle Rodgers. 2021. “Model Fit and Comparison in Finite Mixture Models: A Review and a Novel Approach.” In Frontiers in Education, 6:613645. Frontiers Media SA.
Kaplan, Magidson, D. 2004. The SAGE Handbook of Quantitative Methodology for the Social Sciences. SAGE Publications, Inc. https://methods.sagepub.com/book/the-sage-handbook-of-quantitative-methodology-for-the-social-sciences.
Law, Ernest H, and Rachel Harrington. 2016. “A Primer on Latent Class Analysis.” Value & Outcomes Spotlight 2 (6): 18–19.
Naldi, Luigi, and Simone Cazzaniga. 2020. “Research Techniques Made Simple: Latent Class Analysis.” Journal of Investigative Dermatology 140 (9): 1676–80.
Nylund-Gibson, Karen, and Andrew Young Choi. 2018. “Ten Frequently Asked Questions about Latent Class Analysis.” Translational Issues in Psychological Science 4 (4): 440.
Petersen, Kimberly J, Pamela Qualter, and Neil Humphrey. 2019. “The Application of Latent Class Analysis for Investigating Population Child Mental Health: A Systematic Review.” Frontiers in Psychology 10: 1214.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sinha, Pratik, Carolyn S Calfee, and Kevin L Delucchi. 2021. “Practitioner’s Guide to Latent Class Analysis: Methodological Considerations and Common Pitfalls.” Critical Care Medicine 49 (1): e63.
Weller, Bridget E, Natasha K Bowen, and Sarah J Faubert. 2020. “Latent Class Analysis: A Guide to Best Practice.” Journal of Black Psychology 46 (4): 287–311.